Current Issue : July - September Volume : 2015 Issue Number : 3 Articles : 5 Articles
The spatio-temporal-prediction (STP) method for multichannel speech enhancement has recently been proposed.\nThis approach makes it theoretically possible to attenuate the residual noise without distorting speech. In addition,\nthe STP method depends only on the second-order statistics and can be implemented using a simple linear filtering\nframework. Unfortunately, some numerical problems can arise when estimating the filter matrix in transients. In such a\ncase, the speech correlation matrix is usually rank deficient, so that no solution exists. In this paper, we propose to\nimplement the spatio-temporal-prediction method using a signal subspace approach. This allows for nullifying the\nnoise subspace and processing only the noisy signal in the signal-plus-noise subspace. As a result, we are able to not\nonly regularize the solution in transients but also to achieve higher attenuation of the residual noise. The experimental\nresults also show that the signal subspace approach distorts speech less than the conventional method....
Automatic forensic voice comparison (FVC) systems employed in forensic casework have often relied on Gaussian\nMixture Model - Universal Background Models (GMM-UBMs) for modelling with relatively little research into\nsupervector-based approaches. This paper reports on a comparative study which investigates the effectiveness of\nmultiple approaches operating on GMM mean supervectors, including support vector machines and various forms\nof regression. Firstly, we demonstrate a method by which supervector regression can be used to produce a forensic\nlikelihood ratio. Then, three variants of solving the regression problem are considered, namely least squares and ?1\nand ?2 norm minimization solutions. Comparative analysis of these techniques, combined with four different scoring\nmethods, reveals that supervector regression can provide a substantial relative improvement in both validity (up to\n75.3%) and reliability (up to 41.5%) over both Gaussian Mixture Model - Universal Background Models (GMM-UBMs)\nand Gaussian Mixture Model - Support Vector Machine (GMM-SVM) results when evaluated on two studio clean\nforensic speech databases. Under mismatched/noisy conditions, more modest relative improvements in both validity\n(up to 41.5%) and reliability (up to 12.1%) were obtained relative to GMM-SVM results. From a practical standpoint,\nthe analysis also demonstrates that supervector regression can be more effective than GMM-UBM or GMM-SVM in\nobtaining a higher positive-valued likelihood ratio for same-speaker comparisons, thus improving the strength of\nevidence if the particular suspect on trial is indeed the offender. Based on these results, we recommend least\nsquares as the better performing regression technique with gradient projection as another promising technique\nspecifically for applications typical of forensic case work....
Automatic diagnosis and monitoring of Alzheimer�s disease can have a significant impact on society as well as the\nwell-being of patients. The part of the brain cortex that processes language abilities is one of the earliest parts to be\naffected by the disease. Therefore, detection of Alzheimer�s disease using speech-based features is gaining increasing\nattention. Here, we investigated an extensive set of features based on speech prosody as well as linguistic features\nderived from transcriptions of Turkish conversations with subjects with and without Alzheimer�s disease. Unlike most\nstandardized tests that focus on memory recall or structured conversations, spontaneous unstructured conversations\nare conducted with the subjects in informal settings. Age-, education-, and gender-controlled experiments are\nperformed to eliminate the effects of those three variables. Experimental results show that the proposed features\nextracted from the speech signal can be used to discriminate between the control group and the patients with\nAlzheimer�s disease. Prosodic features performed significantly better than the linguistic features. Classification\naccuracy over 80% was obtained with three of the prosodic features, but experiments with feature fusion did not\nfurther improve the classification performance....
Music identification via audio fingerprinting has been an active research field in recent years. In the real-world\nenvironment, music queries are often deformed by various interferences which typically include signal distortions and\ntime-frequency misalignments caused by time stretching, pitch shifting, etc. Therefore, robustness plays a crucial role\nin music identification technique. In this paper, we propose to use scale invariant feature transform (SIFT) local\ndescriptors computed from a spectrogram image as sub-fingerprints for music identification. Experiments show that\nthese sub-fingerprints exhibit strong robustness against serious time stretching and pitch shifting simultaneously. In\naddition, a locality sensitive hashing (LSH)-based nearest sub-fingerprint retrieval method and a matching\ndetermination mechanism are applied for robust sub-fingerprint matching, which makes the identification efficient\nand precise. Finally, as an auxiliary function, we demonstrate that by comparing the time-frequency locations of\ncorresponding SIFT keypoints, the factor of time stretching and pitch shifting that music queries might have\nexperienced can be accurately estimated....
This paper presents a voice conversion (VC) method that utilizes conditional restricted Boltzmann machines (CRBMs)\nfor each speaker to obtain high-order speaker-independent spaces where voice features are converted more easily\nthan those in an original acoustic feature space. The CRBM is expected to automatically discover common features\nlurking in time-series data. When we train two CRBMs for a source and target speaker independently using only\nspeaker-dependent training data, it can be considered that each CRBM tries to construct subspaces where there are\nfewer phonemes and relatively more speaker individuality than the original acoustic space because the training data\ninclude various phonemes while keeping the speaker individuality unchanged. Each obtained high-order feature is\nthen concatenated using a neural network (NN) from the source to the target. The entire network (the two CRBMs and\nthe NN) can be also fine-tuned as a recurrent neural network (RNN) using the acoustic parallel data since both the\nCRBMs and the concatenating NN have network-based representation with time dependencies. Through\nvoice-conversion experiments, we confirmed the high performance of our method especially in terms of objective\nevaluation, comparing it with conventional GMM, NN, RNN, and our previous work, speaker-dependent DBN\napproaches....
Loading....